25 research outputs found
VIRDO++: Real-World, Visuo-tactile Dynamics and Perception of Deformable Objects
Deformable objects manipulation can benefit from representations that
seamlessly integrate vision and touch while handling occlusions. In this work,
we present a novel approach for, and real-world demonstration of, multimodal
visuo-tactile state-estimation and dynamics prediction for deformable objects.
Our approach, VIRDO++, builds on recent progress in multimodal neural implicit
representations for deformable object state-estimation [1] via a new
formulation for deformation dynamics and a complementary state-estimation
algorithm that (i) maintains a belief over deformations, and (ii) enables
practical real-world application by removing the need for privileged contact
information. In the context of two real-world robotic tasks, we show:(i)
high-fidelity cross-modal state-estimation and prediction of deformable objects
from partial visuo-tactile feedback, and (ii) generalization to unseen objects
and contact formations
NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis
Expert demonstrations are a rich source of supervision for training visual
robotic manipulation policies, but imitation learning methods often require
either a large number of demonstrations or expensive online expert supervision
to learn reactive closed-loop behaviors. In this work, we introduce SPARTN
(Synthetic Perturbations for Augmenting Robot Trajectories via NeRF): a
fully-offline data augmentation scheme for improving robot policies that use
eye-in-hand cameras. Our approach leverages neural radiance fields (NeRFs) to
synthetically inject corrective noise into visual demonstrations, using NeRFs
to generate perturbed viewpoints while simultaneously calculating the
corrective actions. This requires no additional expert supervision or
environment interaction, and distills the geometric information in NeRFs into a
real-time reactive RGB-only policy. In a simulated 6-DoF visual grasping
benchmark, SPARTN improves success rates by 2.8 over imitation learning
without the corrective augmentations and even outperforms some methods that use
online supervision. It additionally closes the gap between RGB-only and RGB-D
success rates, eliminating the previous need for depth sensors. In real-world
6-DoF robotic grasping experiments from limited human demonstrations, our
method improves absolute success rates by on average, including
objects that are traditionally challenging for depth-based methods. See video
results at \url{https://bland.website/spartn}
Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks
Rearranging and manipulating deformable objects such as cables, fabrics, and
bags is a long-standing challenge in robotic manipulation. The complex dynamics
and high-dimensional configuration spaces of deformables, compared to rigid
objects, make manipulation difficult not only for multi-step planning, but even
for goal specification. Goals cannot be as easily specified as rigid object
poses, and may involve complex relative spatial relations such as "place the
item inside the bag". In this work, we develop a suite of simulated benchmarks
with 1D, 2D, and 3D deformable structures, including tasks that involve
image-based goal-conditioning and multi-step deformable manipulation. We
propose embedding goal-conditioning into Transporter Networks, a recently
proposed model architecture for learning robotic manipulation that rearranges
deep features to infer displacements that can represent pick and place actions.
We demonstrate that goal-conditioned Transporter Networks enable agents to
manipulate deformable structures into flexibly specified configurations without
test-time visual anchors for target locations. We also significantly extend
prior results using Transporter Networks for manipulating deformable objects by
testing on tasks with 2D and 3D deformables. Supplementary material is
available at https://berkeleyautomation.github.io/bags/.Comment: See https://berkeleyautomation.github.io/bags/ for project website
and code; v2 corrects some BibTeX entries, v3 is ICRA 2021 version (minor
revisions
iNeRF: Inverting Neural Radiance Fields for Pose Estimation
We present iNeRF, a framework that performs mesh-free pose estimation by
"inverting" a Neural RadianceField (NeRF). NeRFs have been shown to be
remarkably effective for the task of view synthesis - synthesizing
photorealistic novel views of real-world scenes or objects. In this work, we
investigate whether we can apply analysis-by-synthesis via NeRF for mesh-free,
RGB-only 6DoF pose estimation - given an image, find the translation and
rotation of a camera relative to a 3D object or scene. Our method assumes that
no object mesh models are available during either training or test time.
Starting from an initial pose estimate, we use gradient descent to minimize the
residual between pixels rendered from a NeRF and pixels in an observed image.
In our experiments, we first study 1) how to sample rays during pose refinement
for iNeRF to collect informative gradients and 2) how different batch sizes of
rays affect iNeRF on a synthetic dataset. We then show that for complex
real-world scenes from the LLFF dataset, iNeRF can improve NeRF by estimating
the camera poses of novel images and using these images as additional training
data for NeRF. Finally, we show iNeRF can perform category-level object pose
estimation, including object instances not seen during training, with RGB
images by inverting a NeRF model inferred from a single view.Comment: Website: http://yenchenlin.me/inerf
Code as Policies: Language Model Programs for Embodied Control
Large language models (LLMs) trained on code completion have been shown to be
capable of synthesizing simple Python programs from docstrings [1]. We find
that these code-writing LLMs can be re-purposed to write robot policy code,
given natural language commands. Specifically, policy code can express
functions or feedback loops that process perception outputs (e.g.,from object
detectors [2], [3]) and parameterize control primitive APIs. When provided as
input several example language commands (formatted as comments) followed by
corresponding policy code (via few-shot prompting), LLMs can take in new
commands and autonomously re-compose API calls to generate new policy code
respectively. By chaining classic logic structures and referencing third-party
libraries (e.g., NumPy, Shapely) to perform arithmetic, LLMs used in this way
can write robot policies that (i) exhibit spatial-geometric reasoning, (ii)
generalize to new instructions, and (iii) prescribe precise values (e.g.,
velocities) to ambiguous descriptions ("faster") depending on context (i.e.,
behavioral commonsense). This paper presents code as policies: a robot-centric
formalization of language model generated programs (LMPs) that can represent
reactive policies (e.g., impedance controllers), as well as waypoint-based
policies (vision-based pick and place, trajectory-based control), demonstrated
across multiple real robot platforms. Central to our approach is prompting
hierarchical code-gen (recursively defining undefined functions), which can
write more complex code and also improves state-of-the-art to solve 39.8% of
problems on the HumanEval [1] benchmark. Code and videos are available at
https://code-as-policies.github.i
Large Language Models as General Pattern Machines
We observe that pre-trained large language models (LLMs) are capable of
autoregressively completing complex token sequences -- from arbitrary ones
procedurally generated by probabilistic context-free grammars (PCFG), to more
rich spatial patterns found in the Abstraction and Reasoning Corpus (ARC), a
general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern
completion proficiency can be partially retained even when the sequences are
expressed using tokens randomly sampled from the vocabulary. These results
suggest that without any additional training, LLMs can serve as general
sequence modelers, driven by in-context learning. In this work, we investigate
how these zero-shot capabilities may be applied to problems in robotics -- from
extrapolating sequences of numbers that represent states over time to complete
simple motions, to least-to-most prompting of reward-conditioned trajectories
that can discover and represent closed-loop policies (e.g., a stabilizing
controller for CartPole). While difficult to deploy today for real systems due
to latency, context size limitations, and compute costs, the approach of using
LLMs to drive low-level control may provide an exciting glimpse into how the
patterns among words could be transferred to actions.Comment: 21 pages, 25 figures. To appear at Conference on Robot Learning
(CoRL) 202
RoboPianist: A Benchmark for High-Dimensional Robot Control
We introduce a new benchmarking suite for high-dimensional control, targeted
at testing high spatial and temporal precision, coordination, and planning, all
with an underactuated system frequently making-and-breaking contacts. The
proposed challenge is mastering the piano through bi-manual dexterity, using a
pair of simulated anthropomorphic robot hands. We call it RoboPianist, and the
initial version covers a broad set of 150 variable-difficulty songs. We
investigate both model-free and model-based methods on the benchmark,
characterizing their performance envelopes. We observe that while certain
existing methods, when well-tuned, can achieve impressive levels of performance
in certain aspects, there is significant room for improvement. RoboPianist
provides a rich quantitative benchmarking environment, with human-interpretable
results, high ease of expansion by simply augmenting the repertoire with new
songs, and opportunities for further research, including in multi-task
learning, zero-shot generalization, multimodal (sound, vision, touch) learning,
and imitation. Supplementary information, including videos of our control
policies, can be found at https://kzakka.com/robopianist
MIRA: Mental Imagery for Robotic Affordances
Humans form mental images of 3D scenes to support counterfactual imagination,
planning, and motor control. Our abilities to predict the appearance and
affordance of the scene from previously unobserved viewpoints aid us in
performing manipulation tasks (e.g., 6-DoF kitting) with a level of ease that
is currently out of reach for existing robot learning frameworks. In this work,
we aim to build artificial systems that can analogously plan actions on top of
imagined images. To this end, we introduce Mental Imagery for Robotic
Affordances (MIRA), an action reasoning framework that optimizes actions with
novel-view synthesis and affordance prediction in the loop. Given a set of 2D
RGB images, MIRA builds a consistent 3D scene representation, through which we
synthesize novel orthographic views amenable to pixel-wise affordances
prediction for action optimization. We illustrate how this optimization process
enables us to generalize to unseen out-of-plane rotations for 6-DoF robotic
manipulation tasks given a limited number of demonstrations, paving the way
toward machines that autonomously learn to understand the world around them for
planning actions.Comment: CoRL 2022, webpage: https://yenchenlin.me/mir